Lab 7 Assignment
Problem 1 : Discuss these following examples with your class mates and explain each case and comment on the results.
a. Throwing Dice as Multinomial Distribution
A multinomial distribution is a distribution that shows the likelihood of the possible results of a experiment with repeated trials in which each trial can result in a specified number of outcomes that is greater than two. A multinomial distribution could show the results of tossing a dice, because a dice can land on one of six possible values. By contrast, the results of a coin toss would be shown using a binomial distribution because there are only two possible results of each toss, heads or tails.
Two additional key characteristics of a multinomial distribution are that the trials it illustrates must be independent (e.g., in the dice experiment, rolling a five does not have any impact on the number that will be rolled next) and the probability of each possible result must be constant (e.g., on each roll, there is a one in six chance of any number on the die coming up).
b. Rolling a die N=100 times
one.dice <- function(){
dice <- sample(1:6, size = 1, replace = TRUE)
return(dice)
}
one.dice() #what is happening here?? try this several times.## [1] 6
What is happening here?
The function is simulating the throw of a die. Each time we run the function, we should be getting a value from 1 to 6 with equal probability.
#what is happening here?
par(mfrow=c(2,2))
for (i in 1:4){
sims <- replicate(100, one.dice())
table(sims)
table(sims)/length(sims)
plot(table(sims), xlab = 'Event', ylab = 'Frequency')
}
What is happening here?
Looks like we’re replicating 100 die throws 4 times and plotting the results counts for each simulation. Theoretically, we would expect all bars to be of the same length, but this might only be acheivable with more than 100 simulations.
c. Rolling a die N=10000 times.
#what is hapening here?
par(mfrow=c(2,2))
for (i in 1:4){
sims <- replicate(10000, one.dice())
table(sims)
table(sims)/length(sims)
plot(table(sims), xlab = 'Event', ylab = 'Frequency')
}What is happening here?
Per my previous comment, I expected a larger simulation to get us closer to equal-sized bars. Since these four simulations each have 10000 throws, we get a lot closer to an equal count for all possible die results.
Problem 2. Multinomial distribution and its Marginals
From the class example

Let’s say that Molly, Ryan and Mr.Bob are buying beer(x1) ,bread(x2) and coke(x3) with probabilities (3/5,1/5,1/5).
- What is the probability that only 1 of them will buy beer, 2 of them will buy Bread , none will buy coke ? Compare the result with theoretical probability.
print(3*(3/5)*(.2^2))## [1] 0.072
print(dmultinom(x=c(1,2,0), prob=c(.6,.2,.2)))## [1] 0.072
- Do a simulation for this scenario and plot the marginal distribution of x1.
my_prob <- c(0.6,0.2,0.2)
experiments <- 10000
samples <- 3
experiments <- rmultinom(n=experiments,size=samples,prob=my_prob)
df = data.frame(experiments)
dft = t(df)
colnames(dft) <- c('Beer','Bread','Coke')
rownames(dft) <- NULL
dft2 <- as.data.frame(dft)
length(dft2[dft2$Beer==1 & dft2$Bread==2,]$Beer)/10000## [1] 0.071
length(dft2[dft2$Beer==1 & dft2$Bread==2,]$Beer)/10000## [1] 0.071
Problem 3: Discuss this with your class mates and comment on the Plots. What can you obeserve from each plot?
Helpful links to answer this question:
-> Contour plot also gives the densities.
https://blog.revolutionanalytics.com/2016/02/multivariate_data_with_r.html
-> Then we have these ellipses; the circular symmetric version of complex normal distribution. https://en.wikipedia.org/wiki/Elliptical_distribution
ellipse: https://en.wikipedia.org/wiki/Ellipse
-> http://cs229.stanford.edu/section/gaussians.pdf
This tells you how when correlation coefficient increases the distribution spread and how the ellipses look like. -> https://online.stat.psu.edu/stat505/book/export/html/636
library(tidyverse)
library(mvtnorm)
library(plotly)
library(MASS)
library(ggplot2)Source: https://data-se.netlify.app/2018/12/13/visualizing-a-multivariate-normal-distribution/
Simulate multivariate normal data
First, let’s define a covariance matrix Σ:
sigma <- matrix(c(4,2,2,3), ncol = 2)
sigma## [,1] [,2]
## [1,] 4 2
## [2,] 2 3
Then, simulate observations n = n from these covariance matrix; the means need be defined, too. As the rank of our covariance matrix is 2, we need two means:
means <- c(0, 0)
n <- 1000
set.seed(42)
x <- rmvnorm(n = n, mean = means, sigma = sigma)
str(x)## num [1:1000, 1:2] 2.314 1.053 0.716 2.848 3.839 ...
head(x)## [,1] [,2]
## [1,] 2.3139150 -0.15442375
## [2,] 1.0527522 1.24094662
## [3,] 0.7162789 0.05340542
## [4,] 2.8479495 0.69465309
## [5,] 3.8388378 1.03195246
## [6,] 3.7900042 4.47972608
You can see that X is bivariately normal distributed.
Let’s make a data frame out of it:
d <- data.frame(x)
names(d)## [1] "X1" "X2"
a. Plotting univariate (sampled) normal data
## marginal of X1
d %>%
ggplot(aes(x = X1)) +
geom_density()What is happening here?
The plot shows the marginal distribution of X1. That is, the probabilities specific only to that variable.
b. Plot theoretic normal curve and compare with the above marginal distribution of X1.
p1 <- data_frame(x = -3:3) %>%
ggplot(aes(x = x)) +
stat_function(fun = dnorm, n = n)## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
p1What is happening here?
This plot is created to compare a normal distribution to X1’s marginal distribution. Although not exactly the same, the distribution for X1 gets pretty close to a theoretical normal one.
Plotting multivariate data
c. 2D density
p2 <- ggplot(d, aes(x = X1, y = X2)) +
geom_point(alpha = .5) +
geom_density_2d()
p2What is happening here?
We’re used to seeing a density function by laying it out on a histogram (if discrete), but that only works with univariate data. Since we have two here, we take the plot into a higher dimension. This plot shows normal, bivariate distribution. We see this with the higher concentration of points around the middle and increasingly sparse areas as we leave the center of the graph.
e. Contour plot
Geom binhex https://ggplot2.tidyverse.org/reference/geom_hex.html
p3 <- ggplot(d, aes(x = X1, y = X2)) +
geom_point(alpha = .5) +
geom_bin2d() +
scale_fill_viridis_c()
p3What is happening here?
This is basically the same as the previous plot but done differently. We see the distribution as normal, with higher counts near the middle and lower counts on the outside.
f. 2D scatter plot and heatmap with plotly
(p <- plot_ly(d, x = ~X1, y = ~X2))What is happening here?
Again, similar to the other plots, we’re seeing the bivariate distribution between X1 and X2. We see the normal distribution with the dots oscilalting near the center. It’s dense in the middle and sparse on the edges. The edges would be the equivalent to tails in a univariate plot.
add_histogram2d(p)What is happening here?
More of the same but plotted differently. Yellow marks high counts and blue marks dispersed instances.
g. 2D cntour with plotly
add_histogram2dcontour(p)What is happening here?
Although this one is also showing the same information as the other plots, it gets a bit more interesting than the other ones. It shows the information with a quasi-3rd dimension plot. We can begin to see the height of the distribution and see the peak, where most of the events occur.
h. 3D plot: Surface
dens <- kde2d(d$X1, d$X2)
plot_ly(x = dens$x,
y = dens$y,
z = dens$z) %>% add_surface()What is happening here?
This plot is pretty cool. It helps to visualize the densities in 3D. The color legend denotes the lighter colors as higher peaks and it becomes apparent once we move the view around on the plot. ## i. 3D Scatter
First, compute the density of each (X1, X2) pair.
d$dens <- dmvnorm(x = d)Now plot a point for each (X1, X2, dens) tuple.
p4 <- plot_ly(d, x = ~ X1, y = ~ X2, z = ~ dens,
marker = list(color = ~ dens,
showscale = TRUE)) %>%
add_markers()
p4What is happening here?
It seems like we’re plotting the densities of the distribution. Once again, we can see that most of it is centered around both variables equaling 0 and as you move away from the center, there are less points in the area. This graph also proves that we’re dealing with a normal distribution.
Problem 4: Topic Modeling (No need to submit)
Try Topic Modeling on the HPCorpus. Where I have
included Harry Potter Texts and the Lord of the Ring texts.
This article explains Topic Modeling in R very clearly (please follow it) https://www.tidytextmining.com/topicmodeling.html
Also,please follow this article on “stopping words” https://smltar.com/stopwords.html